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Abstract 

With increasing use of digital control it is natural to view control inputs and out- 
puts as stochastic processes assuming values over finite alphabets rather than in a 
Euclidean space. As control over networks becomes increasingly common, data com- 
pression by reducing the size of the input and output alphabets without losing the 
fidelity of representation becomes relevant. This requires us to define a notion of dis- 
tance between two stochastic processes assuming values in distinct sets, possibly of 
different cardinalities. If the two processes are i.i.d., then the problem becomes one 
of defining a metric between two probability distributions over distinct finite sets of 
possibly different cardinalities. This is the problem addressed in the present paper. A 
metric is defined in terms of a joint distribution on the product of the two sets, which 
has the two given distributions as its marginals, and has minimum entropy. Computing 
the metric exactly turns out to be NP-hard. Therefore an efficient greedy algorithm 
is presented for finding an upper bound on the distance. We then study the problem 
of optimal order reduction in the metric defined here. This problem also turns out 
to be NP-hard, so again a greedy algorithm is constructed for finding a suboptimal 
reduced order approximation. Taken together, all the results presented here permit 
the approximation of an i.i.d. process over a set of large cardinality by another i.i.d. 
process over a set of smaller cardinality. In future work, attempts will be made to 
extend this work to Markov processes over finite sets. 



1 Introduction 
1.1 Motivation 

As digital control increasingly replaces analog control, in many situations it is more logical 
to view various signals as discrete- valued, assuming values over a finite alphabet, rather than 
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signals assuming values in some Euclidean space. Suppose we view a control system as an 
input-output map where the input signal is a sequence {ut} assuming values in some finite 
set U, while the output signal is a sequence {yt} assuming values in another finite set Y. In 
this setting, the problem of order reduction is quite different in nature from the traditional 
order reduction problem, where the emphasis is on reducing the dimension of the (Euclidean) 
state space. For the latter problems, well-estabhshed methods such as balanced truncation, 
optimal Hankel norm reduction etc. are appropriate. However, in the case of discrete-valued 
signals, a different paradigm is required. 

If the system has some element of randomness in it, we should view {{ut,yt)} as a 
stochastic process assuming values in the set U xY} For the purposes of controller design, 
it would be worthwhile to know whether the finely quantized inputs and outputs can be 
replaced by a coarser quantization without losing too much accuracy in the representation. 
Such considerations become particularly germane in the problem of control over networks, 
whereby the plant and controller may be connected only through a noisy channel. This type 
of order reduction would require approximating the original stochastic process by another 
one assuming values in a set of smaller cardinality U' x Y'. The approximation can be 
quantified by defining a metric distance between two stochastic processes assuming values 
in distinct sets (of different cardinalities). So far as the author is aware, no such metric is 
available in the literature. The closest the author has been able to find is a paper by Ornstein 
[15] in the inaugural issue of the Annals of Probability, in which he defines a metric distance 
between two stochastic processes assuming values in a common finite set. Thus the author 
was motivated to study the problem of defining a metric distance between two stochastic 
processes assuming values in two distinct sets, as a long-term project. 

Naturally, this general problem is very difficult. As a first step, in this paper we tackle 
the problem of defining a metric between two i.i.d. processes {X^} and {Yt} assuming values 
in two distinct sets A and B. Since an i.i.d. process is completely characterized by its 
one-dimensional marginal, the problem now becomes the following: Given two probability 
distributions (f> on K and if) on M, can we define a distance d{(f), if}) that satisfies the usual 
requirements of symmetry and the triangle inequality! Let us suppose we succeed in this 
endeavor. The next logical question would be the optimal order reduction problem, defined 
as follows: Suppose is a probability distribution on a set A of cardinality n, and suppose 
an integer m < n is specified. What is the distribution on a set of cardinality m that is 
closest to in the metric dl This question would be very natural in the context of reduced- 
order modeling of quantized noise, where the assumption that the noise is an i.i.d. process 

^By this we mean that at each instant of time t, {ut, yt) belongs to the set U xY. 
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is not so unrealistic. The optimal order reduction problem therefore consists of optimally 
approximating a quantized noise with a large number of values by another one with far fewer 
possible values. 

Since our motivations include data compression, transmission over noisy channels, etc., 
it is natural to draw upon well-developed techniques from information theory to achieve 
our objectives. The use of information-theoretic methods in the controls community has a 
long history, going back at least to [16] if not much earlier. In this paper, we first define a 
metric distance between two distributions on distinct finite sets by maximizing their mutual 
information. Second, we show that, given a distribution (f), all optimal reduced-order ap- 
proximations of (f> must be aggregations of </>, that is, obtained by adding together various 
components of 0. It turns out that actually computing the metric distance between two 
probability distributions, and computing the optimal reduced order approximation, are both 
NP-hard problems, because they can both be reduced to nonstandard bin-packing problems. 
Therefore we develop efficient greedy algorithms for both problems. Specifically, we can 
compute an upper bound on the distance in 0((n + m^) logm) operations where n and m 
are the cardinalities of the two sets with n > m, and we can compute a suboptimal reduced- 
order approximation in O(nlogm) operations. Note that both complexities are linear in the 
size of the larger distribution. 

As stated above, we view this paper as just the first step in a program. It may be 
possible to extend the approach proposed here to define a metric distance between two 
Markov processes assuming values in two distinct sets, and to approximate optimally a 
Markov process with a large state space by another with a much smaller state space. Those 
are all questions for future research. 

1.2 A General Observation 

If (p, if) are both probability distributions on a common (finite) set A, then a popular and 
meaningful metric between them is the total variation metric defined by^ 

^) = ma^ \<f>{S) - iP{S)\ = 1 J] 10, - ^|;,\. 

However, if i/j are probability distributions on two different sets A, B of different cardi- 
nalities, there is no obvious way to define a notion of distance between them. To illustrate, 
suppose A = {T, F} to suggest 'True' and 'False', while B = {He, Ta} to suggest 'Head' and 
'Tails'. Suppose </> = [0.3 0.7] and t/j = [0.8 0.2] are probability distributions on A. Then 

^The KuUback-Leibler divergence is neither symmetric nor does it satisfy a triangle inequality. Hence it 
is not a metric, though people often refer to it as a metric or a distance. 
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they are clearly very far apart. Since the set A consists only of abstract symbols, we can 
permute the order of the symbols and write A = {F, T}. In this case the probability distribu- 
tions i/? would also get permuted to <^ = [0.7 0.3] and ijj = [0.2 0.8] respectively, and they 
are still far apart. In other words, the total variation metric has the natural property that 
it is permutation-invariant, provided the same permutation is applied to the components of 
both distributions. Now suppose </> is a distribution on A while ■0 is a distribution on B. 
Are they close or are they far apart? If we were to make the association T ■<-> He, F ■<-> To, 
then they are very far apart, whereas if we were to make the association T ■<-)■ To, F ■<-> He, 
then they are very close. Note that, since the two sets A and B are distinct, even if they 
have the same cardinality, neither association is more natural than the other one. As a 
result, whatever metric we define between probability distributions on distinct sets (even of 
the same cardinality), it must be permutation- invariant even if two different permutations 
are applied to the components of the two distributions. In particular, if we permute the 
components of only one of the two distributions, the distance must remain invariant. Note 
that this is not an artifact of a particular definition. Rather, it is an essential requirement 
that any definition must satisfy. One consequence is that any definition will not reduce to 
the total variation metric even if |A| = |B|. 

1.3 Contributions of the Paper 

In this paper, wc define a distance d{(f), i/j) between two probability distributions 0, if) on 
distinct sets A, B by choosing the joint distribution on A, B respectively, so that has 
marginal distributions (f) and -0 respectively, and also has minimum entropy. This is equiv- 
alent to maximizing the mutual information between the random variables having the two 
distributions. In earlier work [13, 14], a symmetrized version of the mutual information is 
used to define a metric distance, called the 'variation of information' metric, between ran- 
dom variables assuming values in distinct sets. Since our definition is an extension of this 
idea, we too use the same nomenclature and refer to the distance defined here (but between 
probability distributions and not random variables) as the variation of information metric. 

We then proceed to solve the problem of actually computing the distance by finding the 
(or a) with minimum entropy, given </), if). This is facilitated by first proving a principle 
of optimality, which allows us to break down the original problem into smaller and smaller 
problems. Using the principle of optimality, wc show that in the case where m = 2, the 
optimization problem is NP-hard, because it is equivalent to a nonstandard version of the 
bin-packing problem. It is therefore plausible that the problem continues to be NP-hard 
even for m > 3, but we do not explore this issue. Instead we propose a greedy algorithm (for 
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general values of m, not just m — 2) that provides a lower bound on the mutual information, 
and an upper bound on the metric distance between the two distributions. This algorithm 
has complexity 0((n + rn?) logm). 

Finally, once we have a distance measure between two probability distributions on distinct 
sets, it is natural to study the optimal order-reduction problem. Specifically, suppose an n- 
dimensional distribution </) and an integer m < n are specified. The problem is to find a (or 
the) m-dimensional distribution t/? such that the distance i/') is minimal. It is shown 
that all optimal approximations must be aggregations of </), that is, obtained by adding 
together components of 0. Moreover, an aggregation of (f) is an optimal approximation if 
and only if it has maximum entropy amongst all aggregations. It turns out that the problem 
of optimal aggregation is yet another bin-packing problem and thus NP-hard, so we propose 
yet another greedy algorithm for this problem, and also provide a bound on its performance. 
The complexity of this algorithm is O(nlogm). 

2 The Variation of Information Metric 
2.1 Concepts from Information Theory 

We begin with a bit of notation. Throughout the paper, we shall use the symbols A, B, C for 
finite sets of cardinality n, m, / respectively. The symbols X, Y, Z denote random variables 
assuming values in A, B, C respectively. The symbols (/), ijj, ^ denote probability distributions 
on the sets A, B, C respectively. None of these symbols is used in any other way. So in 
particular when we write X, it goes without saying that its probability distribution is (f). 
Though the elements of these sets could be any abstract entities, to avoid notational clutter 
we shall write A = {1, . . . , n} instead of the more precise A = {ai, . . . , a„} etc. Let e denote 
the column vector of all one's, and the subscript denote its dimension. Thus e„ is a column 
vector of n one's. A matrix P G [0, 1]"^^" is said to be stochastic if Pe„ = e^, that is, 
for each row, the sum of all columns equals one. Note that P need not be a square matrix; 
but this definition is consistent with the more familiar usage for square matrices. The set 
of m X n stochastic matrices is denoted by Smxn- If we take the degenerate case of m = 1, 
then the symbol §„ = Sixn denotes the set of nonnegative (row) vectors that add up to one. 
Clearly S„ can be identified with the set A^(A) of all probability distributions on A. 

Suppose X, Y are random variables assuming values in A, B respectively, and let G 
A4(A X B) denote their joint distribution. For each index i between 1 and n, let pi denote 
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the conditional distribution of Y given that X = i. That is 

Pij ~ v^m n ■ 

If Pr{X = i} = 0, then the conditional distribution of Y, conditioned on the impossible event 
X = i, is chosen as its marginal distribution i/j. Note that the matrix P = [pij] belongs to 
Snxm, and the i-th row of P, denoted by p.j, belongs to E>m for each i. If we represent the 
joint distribution of X and F by an n x m matrix = [6ij] where % = Pr{X = ikY = j}, 
then we can write 

P = [Diag((/>)]-i0, (1) 

where Diag((/)) represents the nxn diagonal matrix with 0i, . . . , 0„ as the diagonal elements. 
Suppose we now define Q e S^xn by 

qj,^Pr{X^i\Y^j}. 

Then it is easy to see that the following identities hold: 

e = Diag(</))P = g^Diag(Vj), and Q = [Diag(i/^)]-^P^Diag(</)). (2) 

Now we introduce various concepts from information theory. All the concepts intro- 
duced below are discussed in [6, Chapter 2]. The function /i : [0, 1] — )■ 1R+ is defined by 
h{r) — — rlogr, with the standard convention that h{0) — 0. Note that h is continuously 
differentiable except at r = 0, and that h'{r) — —(1 -l-logr). The symbol H denotes the 
Shannon entropy of a probability distribution. Thus if </> e S„, then 

n n 

H{<P)^-J24>iiog<f>i = J2h{<f>i)- 

i=l i=l 

We shall not make any distinction between the entropy of a probability distribution, and the 
entropy of a random variable having that probability distribution. Thus if X is a random 
variable assuming values in A with probability distribution </>, then we shall use the symbols 
H[(f)) and H{X) interchangeably. 

We define the conditional entropy of Y given X as 

n n m n m 

i=l i=l j=l i=l j=l 

where pj denotes the i-th row of the matrix P. With this definition the identities 

H{X, Y) = H{X) + H{Y\X) = H{Y) + H{X\Y) (3) 
hold. The mutual information between X and Y is defined as 

I{X, Y) = H{X) + H{Y) - H{X, Y) = H{Y) - H{Y\X) = H{X) - H{X\Y). (4) 



6 



2.2 Setting Up the Problem 

Suppose X, Y are random variables assuming values in the sets A, B respectively, with dis- 
tributions <f), ij) respectively. We ask: What is the maximum possible mutual information 
between X and Y? Clearly this is equivalent to asking the question: What is a (or the) 
distribution on A x B that has minimum entropy, while satisfying the boundary conditions 
= 0, = V'? (Here it is obvious that by 0a we mean the marginal distribution of 
on A.) Another way of posing the question is this: Given random variables X and Y, how 
close can we come to making Y a deterministic function of X (and vice versa) by suitably 
selecting their joint distribution? 

Definition 1 Given sets A, B with |A| = n, |B| = m, and given (f) e §„, if) G E>m, define 

W{(f>,^/^):^ min {i/(6>) : 6>a = 0, 6>b = -0}, (5) 

0eX{AxB) 

V{ct>,^l,):=Wici>,^l,)-H {<!>). (6) 

The set of e Al ( A x B) that satisfy the boundary conditions 6^ — (l>,dn — tj^ is 
certainly not empty, because the product distribution (jixif) satisfies this requirement. Since 
the feasible region of 6 is nonempty and compact, and H{0) is a continuous function of 6, 
the minimum in (5) is certainly achieved, and as a result both VF(0, tjj) and F (0, tj^) are 
well-defined. Moreover, it is obvious that 

W{^, 0) = Wi<t>, V'), Vi^jj, 4>) = V{cl>, V^) + H{<t>) - H{^), (7) 

where the second identity follows from (3). Hence W is symmetric in its two arguments, 
whereas V is not; however, V{tl), (j>) and V{(j>, tjj) are related via (7). 

Given two random variables X, Y taking values in A, B and having distributions (/>, if) 
respectively, we see that: VF(</>, ■j/j) is the minimum entropy of the joint random variable 
(X, y), V{(f),^) is the minimum conditional entropy H(Y\X), while i?(i/') — V{(f>,'ip) — 
H{(l)) — V (if), (f)) is the maximum mutual information between X and Y. 

2.3 The Variation of Information Metric 

Finally we come to the definition of the metric. We begin by defining a metric between 
random variables, and then move on to distributions. 

Definition 2 Given two random variables X, Y , the VEiriation of information between 
them is defined as 

viX,Y)^H{X\Y) + H{Y\X). (8) 
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This measure is introduced in [13, 14] where it is referred to as the 'variation of infor- 
mation' metric between random variables. So we retain the same nomenclature, though our 
metric is between probability distributions. By making liberal use of (4), we can derive 
several equivalent expressions for the variation of information. 

v{X,Y) = H{X\Y) + H{Y\X)=H{X) + H{Y)-2I{X,Y) 

= H{X,Y) - I{X,Y) ^2H{X,Y) - H{X) - H{Y). (9) 

Theorem 1 The function v{-,-) satisfies the axioms of a pseudometric. Thus v has the 
properties that for all random variables X,Y,Z, we have v{X,Y) > 0, v{X,Y) — v{Y,X), 
and v(X, Y) < v(X, Z) + v{Y, Z). 

Proof: It is obvious that v{X, Y) > 0, and it follows from (8) that v{X, Y) = v(Y, X). To 
show that f (■, ■) satisfies the triangle inequality, we first prove a one-sided triangle inequality 
by combining basic properties of entropy and conditional entropy. A good reference for 
these details is [6]. Note that some of these inequalities also hold over infinite sets, but 
such generality is not required in the present paper. Suppose X, Y, Z are random variables, 
assuming values in finite sets A, B, C respectively. First, by definition we have 

H{X, Z) = H{Z) + H{X\Z). 

This identity remains valid if everything is conditioned on Y . Thus 

H{X, Z\Y) = H{Z\Y) + H{{X\Z)\Y). 

Then we have the law of iterated conditioning which states that 

H{{X\Z)\Y) = H{X\Y,Z). 

In other words, whether we condition X simultaneously on the pair (F, Z), or first condition 
X on Z and then condition the resulting X\Z on Y , the resulting conditional entropies are 
the same. Next, 

H{X\Y,Z) < H{X\Y). 

In other words, conditioning on more variables cannot increase entropy. Finally, 

H{X,Z\Y) > H{X\Y). 

In other words, the entropy of the pair H{X, Z) is no smaller than the entropy of the single 
random variable H{X), and this statement remains valid even if everything is conditioned 
on Y. All of these arguments can be combined into the following chain of reasoning: 

H(X\Y) < H{X,Z\Y)^H(Z\Y) + H(X\Z,Y) 

< H{Z\Y) + H(X\Z). (10) 
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To prove the triangle inequality, invoke the one-sided triangle inequality (10) and observe 
that 



v{X,Y) 



= H{X\Y) + H{Y\X) 

< H{X\Z) + H{Z\Y) + H{Y\Z) + H{Z\X) 

= v{X,Z)+v{Y,Z). 



This completes the proof. 



Now wc turn the above pscudometric between random variables into a pseudometric 
between probability distributions. 

Definition 3 Given two probability distributions </> e S„,i/? e the veiriation of infor- 
mation metric between them is defined as 



Also, it is clear from the definition that d{(f), if)) is the minimum value of the variation 
of information v{X, Y) over all random variables X, Y with marginal distributions </>, ij) 
respectively. 

Theorem 2 The function d defined in (11) is a pseudometric in that it is nonnegative, 
symmetric and satisfies the triangle inequality. 

Proof: It is obvious that d is nonnegative and symmetric; so it only remains to prove 
the triangle inequality. To prove this, we first establish a small technical point. Suppose 
T7 e A1(A X C), C e A1(B X C) and that ry^ = Cc = ^- Then it is always possible to find a 
distribution i/ e A1(A x B x C) such that vkxC = and vmxC — C- In words, the claim is 
that, given two joint distributions, one of X and Z, and another of Y and Z, both of them 
having the same marginal distribution for Z, it is possible to find a joint distribution for all 
three variables X, Y, Z such that the marginal distributions of {X, Y) and of (F, Z) match 
the two given joint distributions. There is no claim that such a triple distribution is unique 
- only that such a distribution exists. To establish the claim, we construct u by making X 
and Y conditionally independent given Z, or equivalently, by making X ^ Z ^ Y into a 
very short Markov chain. Accordingly, let 



d{4>,i,) = v{ct>,i,) + v{i,,4>). 



(11) 



An alternate expression for d is a ready consequence of (6). 



VikCjk 
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It is routine to verify that v has the required properties, using the identities 

Now we return to the proof that d satisfies the triangle inequahty. Given three different 
probabihty distributions e A1(A), i/> e A1(B), ^ e A1(C), let us choose distributions 
6> e A4(A X B), r/ e >1(A x C) and C e ^(B x C) such that 

6>A = 0, 6>B = -0, - VF(</>, -0), (12) 

^A = 0,^c = ^,^N = m0,O, (13) 
CB = V',Cc-t^(C) = W^(V',0- (14) 

In other words, d,r],C are chosen to be optimal for maximizing the mutual information 
between X and Y, between X and Z, and between Y and Z respectively. As pointed out 
earlier, the minimum in (5) is certainly achieved, so there is no difficulty about this. Now 
choose 1/ to be any distribution on A x B x C such that 

l^AxC = rj, ^MxC = C- (15) 

As shown above, such a choice of u is always possible, though certainly not unique. Depend- 
ing on how we choose u above, there is no reason to assume that i^axB equals 6. Indeed this 
may not even be possible. That is precisely the point. Let X, Y, Z be three random variables 
with the joint distribution u. Then the triangle inequality for the quantity v shows that 

v{X,Y)<v{X,Z)+v{Y,Z). 

The manner in which rj and C were chosen shows that 

viX,Z)=di<l>,^), v{Y,Z) = d{^,i). 

However, an analogous statement about v{X, Y) may not be true. So we note instead that 
d(</>, ■0) is the minimum of v{X, Y) whenever X and Y have distributions 0, £ respectively. 
Hence 

d(</>, il,) < v{X, Y) < v{X, Z) + v{Y, Z) = i) + d{il,, 0, 
which is the desired conclusion. ■ 
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2.4 Properties of the Metric 

We conclude this section with a few simple observations. 



Theorem 3 If (p E ^ (md both have strictly positive elements, then d{(f),i/^) = if 

and only if n = m and 0, i/) are permutations of each other. 

Proof: Prom the definitions, it is clear that d{(f), ijj) — ii and only if V{(f), ■0) = 
y (f)) — 0. In other words, it should be possible to choose a joint distribution 6 such 
that 0^ — 4),O^ — il) and in addition, H{X\Y) — H{Y\X) — where X, Y are the random 
variables associated with ip. Since both </>, tj) are strictly positive, this is possible only if 
the matrices of conditional probabilities P and Q (defined in (1) and its analog) consist of 
only zeros and ones. In turn this is possible (if and) only if n = m and </>, i/) are permutations 
of each other, so that X, Y are deterministic functions of each other. ■ 

Theorem 3 readily implies that the metric d is not a convex function. To see this, let 



Then d{(j), 4>) = 0, and d{(p, ip) = because i/j is a permutation of </>. However, since fi is 
not a permutation of (p, it follows that (i(</), /i) > 0. 

3 Computing the Metric 

3.1 Problem Formulation and Elementary Properties 

Now that we have defined the metric, the next step is to compute it. For this purpose, 
we have one of two options. Given <p G E>n,ip G E>m, we can either compute the function 
W{<p, ip) by minimizing the entropy of the joint distribution, or else compute the function 
V{<p, ip) by minimizing the conditional entropy H(Y\X). Note that if we compute V{<p, ip), 
then V{tl),4>) is automatically determined by (7). Also, minimizing the conditional entropy 
maximizes the mutual information, so we refer to this approach as MMI. For reasons that 
will become later, wc assume that n > m. Clearly there is no loss of generality in doing this. 
The next step is to rcparamctrize the problem, by changing the variable of optimization from 
the joint distribution 6 G S„m to the matrix of conditional probabilities P G Snxm- Thus 
the boundary conditions 6^ = <p,6n = il^ get replaced by (f>P = if). Also, it is clear that, for 
a particular choice of P, the conditional entropy H{Y\X) is given by 



</)=[0.3 0.7],V'=[0.7 0.3], iu = 0.50 + 0.5-0= [0.5 0.5]. 



n 




(16) 



1=1 
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where Pi is the i-th row of P. Moreover, it follows from (3) that if P and Q are related by 
(2), then 

MQ) = .MP) + H{cl>)-Hi^P). (17) 

Finally, it is easy to see that, given (f) e Sn,''/' ^ ^m, the quantity V defined in (6) can also 
be defined equivalently as 

V{4>,^)= nun J^P). (18) 

Since maximizing mutual information is equivalent to minimizing conditional entropy, we 
can now formulate the problem under study precisely. 

MMI Problem: Given </> e S„, i/? e S^, find a P e §nxm that minimizes J<j}{P) subject 
to the boundary condition (pP — ip. 

It is clear that the feasible region for this problem 

J^:={Pe §„xm : 0P = ^A} (19) 

is a polyhedral convex set, because it is defined by a finite number of linear equalities and 
inequahties. Recall that an element of a convex set is said to be an extreme point if it 
cannot be expressed as a nontrivial convex combination of two other points belonging to the 
set. Since T is polyhedral, it has only a finite number of extremal points. 

Theorem 4 Suppose all elements of 4> are strictly positive. Then the solution to the opti- 
mization problem in (17) occurs at an extreme point of T. Thus if P achieves the minimum 
of J(i>{-), then at least one element of P is zero. 

Proof: The objective function J^{-) is strictly concave, because as is well-known, the 
entropy function H{-) is strictly concave, and is just a positive linear combination of n 
strictly concave functions. So if P is not an extreme point of say P = \Q + (1 — X)R 
where A > and Q,R & J-' with Q R, then the strict concavity of Jff, implies that 

J^{P) = J^XQ + (1 - X)R) > XJ^iQ) + (1 - X)J^{R). 

So at least one of J(j,{Q) , J^{R) must be less than J(f,{P). Clearly any matrix P with all 
positive elements is in the interior of S^xm and is thus in the (relative) interior of J- and 
thus is not extreme, and thus cannot be an optimum. 

3.2 A Principle of Optimality 

Given the additive nature of the objective function J</,(i-*), it is hardly surprising that we 
can prove a principle of optimality for this problem. In spite of its simplicity, the principle of 
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optimality is extremely useful, in that it permits us to reduce large problems into a succession 
of smaller problems. 

To state the principle of optimality, we introduce a little bit of notation. Suppose </> e 
§„, -0 G §rn are specified, and that 0i > for all i. Suppose A = {1, . . . , n}, and let A' be 
a nonempty proper subset of A. For notational convenience, suppose A' = {1, . . . , A;} where 
k < n. For G §„, P G S„xm; define 



and note that P' e S^xm, though in general need not belong to S^. After this elaborate 
build-up we can now state the principle of optimality. 

Theorem 5 With all notation as above, suppose 0j > Vi, and suppose that P* minimizes 
J(j){P) subject to the constraint that (f)P = ijj. Define c = (j)'^k > 0, and ijj' = Yli=i 't'iPi — 
(f)'{P*)'. Observe that (l/c)(/)' G Sfc, {^/c)ip' G §m- Then (P*)' minimizes J(f,\P') overSkxm 
subject to the constraint that (l/c)</)'P' — (l/c)i/?'. 

Proof: Note that (P*)' is also a stochastic matrix in that {P*yem = e^. Hence 

^'em = (t>\P*)'em = = c> 0, 

because every component of is positive. Hence if)' is certainly not the zero vector, even 
though some components of could be zero. Thus (l/c)0' G Sfe, (l/c)V'' G Sm, and the 
minimization problem under study is similar to the larger problem. To prove the claim, 
suppose by way of contradiction that there exists another matrix Q' G S^xm such that 

k k 

cj>'Q' = MQ') = E '^^^(q^) < Min') = E ^Mp*i)- 

i=l i=l 

Define in an analogous fashion 



{py = 



and note that, since P* is feasible for the original problem, we have that 

n 

i=k+l 



Pi 



Pfe 



Pfc+1 



Q' 

[py 
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Now 



k n 

i=l i=k+l 

k n 

i=l i=k+l 



while 



<j>Q^<f>'Q'+ ^ 0,p* = t/;' + - V'' = V'- 

i=k+l 

Hence Q is feasible for the original problem and has a lower objective function, which is a 
contradiction. Hence (P*)' is a minimizer of the reduced-size problem. ■ 

4 Solution to the MMI Problem in the n x 2 Case 
4.1 The 2x2 Case 

In this subsection we give an explicit closed-form expression for V{4>, if)) when n = m = 2 
and (f),if) & E>2- Without loss of generality, assume that 7^ -0 because V{<p,ip) = if 
(f) — il). Also, again without loss of generality, rearrange the elements of 0, t/? such that both 
vectors are in strictly increasing order. ^ Then we can distinguish between two cases, namely 
(a) < Vi < 01 < 02 < -02, and (b) < 0i < -01 < -02 < 02- 

Theorem 6 Suppose n = m = 2 and cj),xl) G S2. Suppose further that we have either case 
(a) or case (b) above. // < ■01 < 0i < 02 < '02; then 

V{(j>, tA) = 0i//(v2i) = -01 [(V'lM) log(V'i/0i) + ((1 - V'i/0i) log(l - Vi/0i)] , (20) 

where 

V21 - [ ^ i - ^ 

7/0 < 01 < V'l < V'2 < 02 then 

V{(f>, -0) = 02/7(Vi2) = -02 [(1 - V'2/02) l0g(l - V2/02) + (V'2/02) log(V'2/02)] , (21) 

where 



Vl2 



<t>2 4>2 



^To avoid unnecessary pedantry, we assume that lots of strict inequalities hold. The modifications needed 
to handle the case where some of the inequalities are not strict are easy and are left to the reader. 
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Proof: Prom Theorem 4, we know that any optimal choice of P e §2x2 must be an 
extreme point of the feasible region. Thus at least one component of P must be zero. The 
constraints that P is stochastic and that (t)P = if) lead to the following four possible extreme 
points of the feasible region. 



21 








1 


'ijn 


1 


_ 






4>2 - 


- 


1 


tpl - 






01 







1 



,Pl2 = 



1 
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4>2 4>2 







Moreover, since '02 > 02 and ■02 > 0i, it follows that P12 and P22 are infeasible, and the only 
possibilities are Pn and P2i- So all we need to do is to compute J</,(Pii), J(j,{P2i), and pick 
the one that is smaller. 

Accordingly, let us define 

Vu = J^Pn) = 02[/i(V'i/02) + h{l - (V'lM))], 

V21 = MP21) = 0i[/i(^i/0i) + h{l - (V'lM))]. 

Equation (20) is established if it can be shown that V21 < Vu. This is just an exercise in 
elementary calculus. Let us replace ipi by u, and define 

/„(0) := 0[/i(w/0) + hil - (22) 

The desired conclusion follows if it can be shown that for a fixed u > 0, the function 
!->■ /u(0) is strictly increasing when > it. Note that 



M) = 



U <p — U ^ 

- log - + — — log 

U <p — U 



— U log (j) — U log U+ {(j) — u) log — (0 — m) log(0 — u) 

— log — (0 — m) log(0 — U) — U log U 

= -h{4)) + h{4)-u) + h{u). 



Note that h'{x) = -{1 + logx). So 
dfu 



1 + Iog0- 1 -log(0-'u) =log- 

acp (p — u 



> if > 



So Vu = /v,i(02) > 1^22 = UA<t>l) if 02 > 01 > -01- 

Now suppose we are in case (b). Since the optimal choice of is symmetric in (f) and ■0, 
we can apply the above formula with the roles of (f) and 1/? interchanged (which would put it 
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in case (a)). Then we interchange the symbols of </), i/) and permute rows and columns in a 
symmetric fashion. This leads to the conclusion that, in case (b), the optimal choice of is 



01 
^2 - i>2 02 



d = 

and leads to the formula (21). ■ 
4.2 The n X 2 Case 

We begin with a notion that is encountered again several times in the paper. 

Definition 4 Given G S„, i/? e with n > m, ip is said to be an aggregation of </> if 
there exists a partition of A into m sets Ii, . . . ,1^ such that X^jg/^. <l>i — V'j foi^ j — 1, ■ ■ ■ , 

Next we introduce a nonstandard bin-packing problem which is also encountered several 
times later on. The standard bin-packing problem with bin capacity of one is as follows: 
Given a list where < 1 Vi, the objective is to find the smallest integer / for 

which there exists a partition of {1, . . . , n} into / sets Ii, . . . ,Ii such that XI je/ — ^ 
j = 1, . . . ,1. Thus the objective is to pack the given list into as few bins as possible. This 
problem is known to be NP-hard. Now consider a variant of the problem that is relevant 
to the present situation. Given (f) G S„,i/' ^ with n > m, think of ipi, . . . ^ipm as the 
capacities of m bins, and of 0i, . . . ,0„ as the list to be packed into these m bins. Since 
Yli^'i = '^j'^j ~ there are only two possibilities: First, if) is an aggregation of 0, in 
which case all bins can be filled up exactly. Second, ij) is not an aggregation of 0, in which 
case some bins are overstuffed, while the rest have unutilized capacity. The bin-packing 
problem with overstuffing and variable bin capacities is as follows: Find a partition of A into 
m sets Ii, . . . ,Ira such that the total mismatch 



jei 



is as small as possible. Note that we can also define the total unutilized capacity and the 
total overstuffing as 

jeB \ ieij J ^ jeM \ ieij 

In these definitions we use the standard convention that = max{x, 0} and = 

min{x, 0}. Then it is clear that OS = UC and that MI = 20S = 2UC. So the problems 



16 



of minimizing the total mismatch, or total overstuffing, or total unutilized capacity, are all 
equivalent. Unfortunately, this problem is also NP-hard [7]. Even determining whether 
a given ?/> is an aggregation of a given </> or not is also NP-hard. The bin packing with 
overstuffing is discussed in [7, 3, 4] among other papers. 

With this background, we now present a partial solution to the problem of computing 
V{(f), V') when m = 2 in terms of the bin-packing problem with overstuffing with two bins. 
If if) is an aggregation of </>, then obviously V{(l),il)) ~ 0. Otherwise, let ipi,ip2 denote 
the capacity of the two bins, and let denote the list to be packed. Without 

loss of generality, assume that the (pi are in decreasing order of magnitude. Let M11H2 
denote an optimal partition oi H — {l,...,n} and let c denote the minimum unutilized 
capacity. Again, without loss of generality, assume that bin 1 is underutilized and that bin 
2 is overstuffed. This means that 

V'2 - J] 0i = -^1 + 5Z 0^ = c. (23) 

Theorem 7 Suppose if) is not an aggregation of 4>, and solve the bin-packing problem as 
above. If n & M2, then an optimal choice of P that minimizes J(i,{P) subject to (pP — 1/) is 
given by 

Pi = [1 0] e Ml, p^ = [0 1] yzeM2\ {n}, Pn = [ c/<Pn {<Pn - C) / <Pn ]• (24) 

Moreover 

y(0,7/;) = 0„//(p„) = /e(0„), 
where the function f is defined in (22). 

Proof: Prom the principle of optimality, we know that if a matrix P is optimal for the 
n X 2 problem, then every 2x2 submatrix is optimal for its respective problem, and thus 
has at most one strictly positive row. Taken together this shows that any optimal choice of 
P has at most one strictly positive row, while the rest are either [1 0] or [0 1]. Accordingly, 
define P as above, and let R be another matrix that has exactly one strictly positive row 
such that 4>R = ip. All we need to do is to show that J(j,{R) > J(i,{P). For this purpose, 
suppose the k-th row of R is strictly positive, and define 

/i = {^:r, = [l 0]},/2 = 0:r, = [0 1]}, 

while Tk is strictly positive. Then (f)R — imphes that 

ui := -01 - ^ 0i > 0, U2 := iJ2 - ^<Pi > 0, Ui + U2 = (pk, 
ieii ieh 
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Tfc = [ M2/0fe ], Jcj>{,R) = (f>kH{rk) = /Mi(0fc), 

where the function / is defined in (22), and we use the fact that U2 = (f)k — Ui. The fact 
that c is the optimal unutiUzed capacity imphes that c < min{Mi, -^2}. This is because if we 
add (j)k to bin 1, then bin 2 has an unutihzed capacity of ^2, whence U2 > c, and similarly 
ui > c. To put it another way, we have that c < mm{ui,U2} < max{ui,U2} < 4>k — c- In 
turn this implies that 

H{rk) = H{[ ui/<j)k U2/<j)k ]) > H{[ c/<Pk {<Pk - c)/<Pk ])■ 
So we now conclude that 

J^{R) = (pkHivk) > (pkHil c/cpk {(pk - c)/(f)k ]) 

= fc{<f>k)>fc{<f>n) = MP) 

because < 0^ and /c(-) is a strictly increasing function. ■ 

Example 1 Theorem 7 applies only when the last index n belongs to the overstuffed bin. In 
case n belongs to the understuffed bin, one might be tempted to modify the matrix P of (24) 
as follows: Let k* be the smallest index in the set M2, and let 

Pi = [1 0] e M, Pi = [0 1] e A/'2 \ {k*}, p^* = [ c/0fc. (0^. - c)/<Pk* ]• 

Unfortunately such a choice of P is not always optimal, as shown here. Let n — 5, and 

ip^[QA 0.6 ],</>=[ 0.50 0.24 0.12 0.071 0.069 ]. 

Then the optimal packing is given by 

M = {2,4,5},A/'2 = {l,3},c = 0.02, 

and the smallest element in A2 is 4>3 = 0.12. Accordingly let us define k* = 3, and 

To 1 0.02/0.12 1 1 1* 
[1 0.10/0.12 0' 

Then (using the natural logarithm to compute entropy) 

H{ps) = 0.4506, J4P) = hHips) = 5.4067 x 10"^. 
// we now swap ^3 into bin 1 and 04, ^5 into bin 2 so that 

A/-; = {2,3},A/-^ = {1,4,5}, 
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then bin 2 is overstuffed by 0.04 > 0.02 as expected. Since 05 is now the smallest element in 
bin 2, let us define 

P'-- 



1 1 0.040/0.069 '* 



1 1 0.029/0.069 

Then 

H{p'^) = 0.6804, J^(P') = 05//(P5) = 4.6947 x 10"' < J^(P). 
So P is not an optimal choice. 

5 Solution to the MMI Problem in the n x m Case 
5.1 Greedy Algorithm for the MMI Problem 

In general, determining whether t/) is an aggregation of <p, or finding the optimal bin allo- 
cations allowing overstuffing, are both NP-hard problems [3, 4]. It follows that computing 
1^(0,'?/'), or equivalently, computing the maximum mutual information, is also NP-hard 
when m = 2. It is therefore plausible that the problem of computing V{<p,ip) continues to 
be NP-hard if 3 < m < n. But we do not explore this issue further. Instead, we borrow a 
standard greedy algorithm for bin-packing with overstuffing from the computer science liter- 
ature [19], known as 'best fit,' and adapt it to the current situation. We begin by arranging 
the elements of ■0 in descending order. In general it is not necessary to sort the elements of 
<f>. 

Given e S„, i/? e with m < n, proceed as follows: 

1. Set s — 1, where s is the round counter. Define Ug — n^mg — (f)^ — 4>,i/}g — 

2. Place each element of (p in the bin with the largest unused capacity. If a particular 
component does not fit into any bin, assign the index i to an overfiow index set 
Kg. 

3. When all elements of (f)g have been processed, let Ii\...,Iml be the indices from 
{1, . . . jUg} that have been assigned to the various bins, and let Kg denote the set of 
indices that cannot be assigned to any bin. If jX^I > 1 go to Step 4; otherwise go to 
Step 5. 

4. Define a^i \ . . . , ami to be the unutilized capacities of the rris bins, and define a*^^-* = 
[cti ^ . . . ami]- Then the total unutilized capacity Cg := a*^^^e,„^ satisfies 

rris 

^s = J^C.f = Y.(^s)r- (25) 
j=l ieKs 
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Since each G Kg does not fit into any bin, it is clear that {(t>s)i > otj i Vi,j. In 

turn this imphes that \Kg\ < nis- Next, set Ug+i = ms,ms+i = \Ks\, and define 

Increment the counter and go to Step 2. 

5. When this step is reached, \Ks\ is either zero or one. If \Ks\ = 0, then it means that 
ipg is a perfect aggregation of (p^. So define = and proceed as below. If \Ks\ — 1, 
then only one element of 0^, call it {4>s)k, cannot be packed into any bin, and this 
component must equal c^. So let 

V, = ia(^) e ^m.,Vg = CgH{vg), Ug^Vg + H{<t>g) - H{il,,). 

Define e §„,xm. by 

Pj = bj if i e /j'\pfc = Vs, 

where hj is the j-th unit vector with components. Then Vs is the minimum value 
of J(j,^{-)-i and P; achieves that minimum. Next, define e §msxns by 

= [diag(iAJ]-^PjDiag((^J. 

Then it follows from (17) that Qs minimizes Jtp^{-), and that Us is the value of that 
minimum. 

6. In this step, we invert all of the above steps by transposing Qs+i, applying the trans- 
formation in (2), and embedding the resulting matrix into P,. We also correct the 
cost function using (17). Decrement the counter s and recall that = ris+i- Recall 
the unutilized capacity Cs defined in (25) which has been found during the forward 
iteration, and define 

Vs = CsUs+i, Us = Vs + H{ct>,) - H{il,g). 

Define P^ e Sn.xm, by 

(s) 

Pi — hj a i e Ij ,Pi — i-ih row of Qs+i- 
If s = 1, halt; otherwise repeat the step. 

The computational complexity of algorithm is easy to bound. The first step is to sort the 
elements of which has complexity O(m^) if we insist on an exact answer or O(mlogm) 
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if we use a randomized algorithm like quick sort. We use the latter bound here. In each 
step of the best fit algorithm, the bin in which the current element of </) has been placed 
has maximum capacity before placing, but necessarily after placing. So it needs to moved 
into the right place. Since the rest of the bins are still in descending order of capacity, 
this can be achieved in O(logm) steps using a bisection search. And this has to be done 
n times. So once if) is sorted, one run of the best fit algorithm has complexity O(nlogm), 
which dominates the complexity O(mlogm) of sorting t/j, since m < n. Since the size of 
the problem decreases at each round, at worst we may have to run the best fit algorithm 
m — 1 times. Moreover, after the first round, the size of the problem is not any larger 
than m X (m — 1). So the overall complexity of the greedy algorithm is no worse than 
O(nlogm) + mO{m\ogm) = 0((n + m^) logm). The fact that the complexity is only linear 
in n is heartening. 



5.2 Illustrative Example 

The application of the greedy algorithm is illustrated via a large example that needs to go 
through three rounds. 

Example 2 To illustrate the above algorithm, we solve a 40 x 10 problem^ First two uni- 
formly distributed random vectors x e [0,1]''°, y G [0,1]^° were generated using the rand 
command o/Matlab. Then these were stretched out via the transformation 

(pi = exp{xi)/si,ijjj = exp{yj)/s2, 

where Si,S2 (if^^ scaling constants to make the sums come out equal to one. Then only the 
smaller vector is sorted in descending order. The results are shown below. For display 
purposes the resulting (f) is shown as a matrix, though of course it is a row vector. 







0.0304 0.0333 0.0153 0.0335 0.0253 0.0148 

0.0350 0.0353 0.0157 0.0355 0.0350 0.0219 

0.0205 0.0336 0.0297 0.0351 0.0259 0.0139 

0.0265 0.0287 0.0283 0.0199 0.0259 0.0160 

0.0177 0.0141 0.0148 0.0307 0.0270 0.0185 



0.0178 0.0232 

0.0299 0.0155 

0.0314 0.0342 

0.0273 0.0139 

0.0348 0.0139 



il) = [ 0.1241 0.1205 0.1192 0.1139 0.1069 0.0914 0.0875 0.0869 0.0821 0.0675 
By applying the best fit algorithm in round one, we find that 



) = {1, 7, 16, 20, 32}, F^'> = {2, 11, 17, 25, 33}, I^'' = {3, 6, 9, 22, 28}, 



r(i) 



(1) 



*The diary of the example is available upon request from the author. 
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r(i) 



{4, 15, 26, 34}, = {5, 14, 21, 30}, l','> = {8, 18, 31}, l','> = {10, 23, 35, 38}, 



(1) 



r(i) 



r(i) 



{12, 24}, = {13, 27, 40}, I',',' = {19, 29}, K, = {36, 37, 39}. 



r(i) 



The unallocated capacity ci = 0.0924. 
For round 2, we therefore have 

02 = [ 0.1237 0.0721 0.0183 0.0825 0.1934 0.0793 0.0639 0.1856 0.0521 0.1291 ], 



and 



,/,2 = — [ 0(36) 0(37) 0(39) ] = [ 0.3317 0.2917 0.3766 ]. 

^1 



Applying the best fit algorithm to this problem results in 

/f ) = {2, 5}, if = {3, 4, 7}, Jf = {1,6, 9}, K, = {8, 10}. 
The unutilized capacity C2 — 0.3146, and 

03 = [ 0.2103 0.4037 0.3860 ]. 

^ = 1[ 02(8) 02(10) ] = [ 0.5898 0.4102 ], 
For this simple problem an exact solution can be computed using Theorem 7, and is 



0.9691 0.0309 
1.0000 
1.0000 



The matrix Qs is now computed as [Diag('j/'3)] ^ P^Dia.g{(j)^) and equals 

Q3-- 



0.3456 0.6544 

0.0158 0.9842 



As per the algorithm, we have 



Vs = V{(j>„^|^,) = J^^iPs) = M,H{{Ps)i) = 0.0290, 

^73 = ^3 + H{(t>3) - H{ilj,) = 0.4136, 

Now we return to round 2. The rows of constitute rows 8 and 10 of the 10 x 3 matrix 

P2, while all other rows are elementary row vectors. The i-th row of P2 equals the j-th 

(2) 

elementary row vector if the index i belongs to the set / • . Further, 



^2 = J</,,(P2) = C2t/3 = 0.1301, 
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U2 = V2 + H{cf)^) - H{ip2) = 0.1301 + 2.1525 - 1.0932 = 1.1894. 

At last we can get back to the initial round. From the 10 x 3 matrix P2, we generate a 
matrix Q2 € [0, Ij^xio 5y the familiar transformation, which leads to 

0.2174 0.5831 0.1933 0.0062 " 

0.0627 0.2827 0.2191 0.4354 . 

0.3286 0.2105 0.3225 0.1384 

The three rows of Q2 form rows 36, 37 and 39 of the matrix P — Pi, while the other 37 rows 
are elementary vectors. 

Finally we compute the values ofV, U. 

Vi = J4,{P) = C1U2 = 0.1099, 

Ui^Vi + H{(t)i) - H{il}i) = 0.1099 + 3.6399 - 2.2853 = 1.4645. 

What this means is that, with the choice of P as described above, whenever X assumes any 
value other than 36, 37 or 39, Y is a deterministic function of X. With this choice of P, 
we have J(p{P) = 0.1099 as the conditional entropy H(Y\X), a reduction of more than 95% 
from H{Y) = H{il^) = 2.2853. Similarly H{X\Y) = Ui ^ 1.4645. So we conclude that 

d((f),ij^) <Vi + Ui^ 1.5744. 

6 Optimal Order Reduction 

Now that we have a way of measuring the metric distance between two probabihty distribu- 
tions defined on sets of different cardinality, we can examine the problem of approximating 
a distribution e by another 1/? G §,„ where m < n, such that the distance d{<p, 1/)) 
between them is as small as possible. We begin by showing that all optimal reduced-order 
approximations must be aggregations of (p. Then we show that minimizing the distance 
d{(f), if}) is equivalent to maximizing the entropy of the aggregation ij). Then the problem of 
maximizing the entropy of the aggregation is formulated as yet another bin-packing problem, 
this time with equal-sized bins. This problem is solved via a best fit greedy algorithm, and 
an upper bound is presented for the performance of the best fit algorithm (with possibly 
unequal bin sizes). 

6.1 All Optimal Reduced- Order Approximations are Aggregations 

Theorem 8 Suppose </> e §„, ■0 e S^, m < and that 1/) is not an aggregation of (p. Then 
there exists a ip' &E>m such that d{4>, ip') < d{(f), 1/)). 
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The proof of the theorem makes use of a couple of prehminary lemmas. 



Lemma 1 Suppose fj, e W?, 7^ 0. Then 



J2h{f^j)-cH{il/c)t^)) + h{c), (26) 



where c — /ite^ is a normalizing constant. 
Proof: We have that 



m m ^ m /' ™ \ 

j=l j=l 3=1 \j=l J 

in 

= ^ log - - clogc = c//((l/c)iu)) + h{c). 



This completes the proof. 



Lemma 2 Suppose ci, C2, 6 > with ci + C2 + 6 = 1. For each A e [0, 1], define i/'(A) e §2 
by 

t/;(A)- [ C1 + A6 C2 + (l-A)6 ], 

and G : [0, 1] R &?/ 

G'(A) = hH{[X 1 - A]) - H{il}{X)). 

Then 

G{X) > min{G(0), G(l)} = min{-i7(^(0)), -H{il^{l))}. (27) 

Proof: This follows from elementary calculus. Using the fact that h'{r) — — (1 + logr) 
for all r > 0, it can be shown that G(A) achieves its maximum at A* = ci/(ci + C2) and is 
decreasing on either side of it. So if A < A*, then G{X) > G{0), whereas if A > A*, then 
G{X) > G{1). In either case, (27) is satisfied. ■ 

Proof (of Theorem 8): Suppose </> e §„, 1/? e m < n, and ■0 is not an aggregation 
of (j). Choose P e Snxm such that (f)P — ■0 and J4>{P) — V{(f),'ip). Since ij) is not an 
aggregation of 4>, at least one row of P contains at least two nonzero (i.e., positive) elements. 
Let k be such a row, and without loss of generality permute the components of in such a 
way that pki > 0,Pk2 > 0. To show that 1/) cannot be an optimal approximation of </> in the 
d metric, we will construct another distribution if)' e that matches 1/) from component 
3 onwards. We will do this by perturbing only the two elements Pki,Pk2 in such a way that 
P'ki ^ 'P'ki — Pki + Pk2, and defining 1/)' — 4>P'. This means that many of the quantities are 
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common to and ■j/j', so in the various equations below, we will just write 'const' to avoid 
notational clutter. 

Prom the manner in which P was chosen, it follows that 

n 

V{(t>, 4>) = ^ <PiH{pi) = 0ifc-f/"(pfc) + const 

i=l 

= (f>k[h{pki) + h{pk2)] + const, 

V{^l^,4>) = V{(l>,'tl^) + H{(j>)-H{^j,) 

= MHPki) + h{Pk2)] - IHi'i) + h{ii2)] + const. 

Note that the 'constant' in the two equations need not be the same. Our use of the phrase 
'constant' means only that all the ignored summations remain unchanged when we replace 
tj) hy Proceeding further, let us write 

n 

i=l i^k 

Similarly, 

n 

^2 = X ^'P''^ ^ X ^'P''^ + ^^^^^2 -■d2 + (t>kPk2- 

1=1 i^k 

With these definitions, we can write 

= MKPki) + Kpk2)] 

- [h{di + 0fePfei) + + (t>kPk2)] + const. (28) 

This looks similar to the function G{X) in Lemma 2, except that neither p^i + pfc2 nor 
di + d2 + 4>k necessarily add up to one. So we proceed as in the proof of Lemma 2 and apply 
the correction terms from Lemma 1 wherever necessary. Let us define 

Pkl + Pk2 Pkl + Pk2 

/3 = Pfel + Pk2, a^di + d2 + I3(f)k, 

and note that 

■01 + ■02 = + + /50fe = a. 

With these definitions, and making repeated use of Lemma 1, we get 

MHPki) + Hpk2)] = /3(l>kH{[X 1 - A]) + 0fe/i(/3), 
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where 



Hence 



7 = 



^ + — + {1-X) — 

a a a a 



e S2. 



a 



H{[X 1-X]) + H{j) 



+ (l)kh{/3) — h{a) + const. 



Now the quantity inside the brackets is hke G{X) in Lemma 2, with 

di d2 , ^(t)k 

Ci = — ,C2 = — ,6 = . 

a a a 

And these three numbers do add up to one. So we know from Lemma 2 that the quantity 
inside the brackets can be made smaller by choosing A = or 1. The choice A = causes 
Pfei,Pifc2, V'l, V'2 to be replaced by 

b'fel P'k2] = [0 Pfcl +Pfc2], Wl ^'2] = [dl d2 + (j)k{Pkl + Pk2)], 

while the choice A = 1 causes Pki,Pk2, i^i, V'2 to be replaced by 

\Pki P'k2\ = bfci +Pk2 0], [i/j'i ^2] = [dl + MPki + Pk2) d2]. 

In either case the numbers (3, a remain the same, whence the correction term (pkh{(5) — h{a) 
also remains the same. So decreasing the quantity inside the brackets reduces V{il), (p). So 
the conclusion is that there exists a P' G S„xm such that, with i/)' = (f)P\ we have 

J4P') + H{(f>)-H{tl,') = <f>k[h{p'ki)+h{p'k2)]-[hm + h{^'2)]+ const 

< MHPki) + h{Pk2)] - [h{ipi) + /i(V'2)] + const 

Now, since V{(f), 1/)') is the minimum of the quantity J(j>{Q) over all Q e §nxm such that 
(f)Q — ij)'^we conclude from the above that 

v(0,v')+^(0)-^(V'')<n^,0), 

or equivalently that 

V{^l,',cf>)<V{tP,(f>). 

Similarly, we can compare V{<p,i/)') and V{(f),'4>). Since p'j^i + p'k2 = Pki + Pk2, and one of 
p'ki,p'k2 is zero, it is obvious that 



MKp'ki) + Kp'k2)] < MKPki) + h(pk2)]- 
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Hence 

.MP')<J4P) = v{(P,^). 

As a consequence, we have as before that 

-0') = min J^(Q) s.t. (tiQ^tl}' < J^P') < V{(t>, t/^). 

Combining both inequahties leads to d{(f), -0') < ■0). ■ 

6.2 Greedy Algorithm for Finding Suboptimal Solution 

Theorem 8 shows that the best reduced-order approximations in Ei^ to the given </> e S„ is 
an aggregation of (j). Now, suppose 0^"^ e is an aggregation of (f). Then it is clear that 
= 0. Hence it follows from (7) that 

Hence an aggregation 0^"-' of (p into m states is an optimal approximation if and only if </>^"^ 
has maximum entropy amongst all aggregations. Note that when m = 2, an aggregation 
has maximum entropy if and only if the total variation U2) is minimized, where 

denotes the uniform distribution with m components. This is a bin-packing problem 
with overstuffing where both bins have capacity 0.5 and is thus NP-hard. It is plausible 
that the problem remains NP-hard for m > 3 as well. So we reformulate the problem. 
Amongst all distributions in S^, the uniform distribution has the maximum entropy. 
Thus we seek an aggregation of <p in such a way that the total variation distance p{<j)^°'\ u^) 
is minimized. This is yet another bin-packing problem with overstuffing, with all bin sizes 
equal to 1/m. A suboptimal aggregation can therefore be found using the best fit algorithm, 
whose complexity is O(nlogm) as discussed earlier. 

Therefore when m = 2 the optimal-reduced order approximation problem is NP-hard, so 
we are justified in seeking an efficient suboptimal algorithm even when m >2. 

We complete this section with an upper bound on the performance of the best fit algo- 
rithm, without assuming that the bin sizes are equal. This result may be of independent 
interest. By specializing to the case where ■0 = u^, this bound can be translated into a 
bound on the entropy of the resulting aggregation. The details are easy and left to the 
reader. 

Theorem 9 For the best fit algorithm with bin size vector 1/), we have 

p((/>('^),V') <0.25m</.^ax- (29) 

where ^max = maxj{0j}. 
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Proof: The steps in the proof follow the corresponding steps in [18] . Once the greedy 
algorithm is completed, let us denote the resulting aggregation 0*-"^ by a to reduce clutter. 
Let us refer to bin j as 'heayy' if aj > ipj, and 'light' if aj < ipj. Suppose there are k heavy 
bins. Without loss of generality, renumber the bins such that the first k bins are heavy and 
the rest are light. Let Ci, . . . , denote the excess and s^+i, • • • , "Sm denote the slack. In 
other words, 

Cj — Oij — — 1, ■ ■ ■ , k, and Sj — ipj — aj, j — k + 1, . . . ,111. 

For j = 1, . . . , k, let Vj denote its excess capacity just before the last item was placed into it 
(making it heavy). Then two things are obvious. First, rj + cj equals the last component of 
4> that was placed into this bin, and as a result rj + Cj < 0max- Second, the nature of the 
algorithm implies that rj is at least equal to the capacity of all the other bins at the time 
this item was placed into bin j. Since bin capacity can only decrease as the algorithm is run, 
in particular this implies that 

fj ^ 'Sfc+l, • • • , Smti = 1, . . . , /c. 

Therefore 



1 ^ 1 

— > To > mm Tj > max s,- > > s-j. 

k ^ j=i,-,k j=k+i,...,m m - k ^ 

j=l j=k+l 

Rearranging gives 

k m 

{m-k)J2rj ^ ^ X] 

j=l j=k+l 

Since both are unit vectors, it follows that 

k m 



j=l j=k+l 



Therefore 

k m 

(m - k) ^{rj + Cj) > m ^ Sj. 

j=i j=k+i 

Note that the right side is precisely ■mp{cx, 1/)). Hence 

p{a, xl)) < > r J + ej) < 0ma^ < — - — , 

which follows from the obvious observation that k{m — k) < no matter what k is. 
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It is quite easy to show that the performance of the algorithm is bounded by O.Sm^max- 
This is because no bin can be overstuffed by more than 0max, and no bin can have unutihzed 
capacity more than 0max- Since the totals of over- and under-capacity have to balance out, 
the bound O.5m0max follows. Thus the real essence of the theorem is to gain an extra factor 
of 0.5. 

The specific result of [18] bounds the total weight of all the bins (call it A) and shows 
that A < 1.25Aopt. Moreover, the proof also requires an extra assumption that 0max < ipmm, 
something that is not needed here. It is easy to verify that the weight of an algorithm equals 
1 -I- p{4>^"'\ ip) achieved by that algorithm. Hence a direct apphcation of the results of [18] 
would imply that 

p(</>("),^) < 1.25popt(</>^"\V') + 0.25. 

Because of the additive constant of 0.25, this bound is less useful than the bound (29) given 
by Theorem 9. 

6.3 Example of Aggregation Using the Greedy Algorithm 

Example 3 Let us return to the 4 0-dimensional probability distribution (p studied in Exam- 
ple 2. As seen earlier, H{(p) — 3.6399, while the maximum entropy that any 10- dimensional 
distribution can have is log(lO) ~ 2.3026. 

Applying the best fit algorithm for aggregation without sorting (f) results in the following 
grouping and aggregation (shown as a matrix for convenience): 

h = {1, 16, 23, 33}, h = {2, 18, 31}, h = {3, 12, 24, 40}, h = {4, 19, 30, 36}, 

h = {5, 15, 27, 38}, /6{6, 11, 17, 25, 34}, I7 = {7, 13, 26, 37}, h = {8, 14, 22, 28, 35}, 

J9 = {9,20,32,39},/io = {10,21,29}, 

0.0951 0.0942 0.0989 0.1099 0.1020 " 
0.0917 0.1085 0.0938 0.1188 0.0871 ' 

We have that H{(j)^"'^ — 2.2934, quite close to the theoretical maximum 0/ 2.3026. 

In contrast, if we first sort (f) before applying the best fit algorithm, the following grouping 
results: 

h = {1, 20, 21, 39}, h = {2, 19, 22, 40}, h = {3, 18, 23, 38}, h = {4, 17, 24, 37}, 

h = {5, 16, 25, 36}, le = {6, 15, 29, 31}, Ir = {7, 14, 27, 34}, h = {8, 13, 26, 35}, 
/g = {9, 12, 28, 33}, /lo = {10, 11, 30, 32}. 
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The resulting aggregation is 

0.1019 0.1021 0.1016 0.1007 0.1004 " 
0.0982 0.0994 0.0993 0.0982 0.0982 J ' 

which is much closer to being uniform than the earlier aggregation. 

7 Conclusions 

In this paper we have studied the problem of defining a metric distance between two proba- 
bihty distributions over distinct finite sets of possibly different cardinalities. Along the way, 
we have formulated the problem of constructing a joint distribution on the product of the 
two sets, which has the two given distributions as its marginals, in such a way that the joint 
distribution has minimum entropy. While the problem of maximizing mutual information is 
occasionally discussed in the literature, this specific problem does not appear to have been 
studied earlier. This problem turns out to be NP-hard, so we reformulated the problem as 
a bin-packing problem with overstuffing, and adapt the best fit algorithm for bin-packing, 
leading to an upper bound on the distance between the two given distributions. The com- 
plexity of this algorithm is 0{{n + m^) logm), where n is the larger of the two cardinalities 
and m is the smaller. 

Once the metric is defined, we then study the problem of optimal order reduction, namely, 
given an n-dimensional distribution, finding the (or a) m-dimcnsional distribution that is 
closest in the metric to the given distribution. This turns out to be equivalent to aggregating 
the original distribution so as to maximize the entropy of the aggregated distribution. This 
problem is also NP-hard. Accordingly, the problem is again formulated as a bin-packing 
problem with overstuffing. A greedy algorithm with complexity O(nlogm) is presented, and 
an upper bound is derived for the error of the greedy algorithm. 

The work described here is in contrast with earlier work [8, 9, 10, 17] in which the attempt 
is to define a notion of divergence, not necessarily a metric, between two distributions on 
distinct sets. 

There are several important follow-up problems that arise from the work reported here. 
The distance between two probability distributions (f), ip can be thought of as the distance 
rate between two i.i.d. processes whose one-dimensional marginals are these two distributions. 
So the results presented here permit the approximation of an i.i.d. process over a set of large 
cardinality by another i.i.d. process over a set of smaller cardinality. In order to be truly 
useful in the context of control over networks for example, the next logical step is to extend 
the theory to finite-state Markov chains, and finite-state hidden Markov models. We propose 
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to tackle this extension in future research. If we succeed in that, then the next logical step 
would be to extend the work still further to arbitrary stationary ergodic processes, using the 
Shannon- McMillan- Breiman theorem. 
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